Voice recognition method and system based on the contexual modeling of voice units

ABSTRACT

The method of recognizing speech in an acoustic signal comprises developing acoustic stochastic models of voice units in the form of a set of states of an acoustic signal and using the acoustic models for recognition by a comparison of the signal with predetermined acoustic models obtained via a prior learning process. While developing the acoustic models, the voice units are modeled by means of a first portion of the states independent of adjacent voice units and by means of a second portion of the states dependent on adjacent voice units. The second portion of states dependent on adjacent voice units shares common parameters with a plurality of units sharing same phonemes.

This application claims priority from PCT/FR2004/000972, filed Apr. 20,2004, which is hereby incorporated by reference in its entirely.

BACKGROUND OF THE INVENTION

The invention deals with the recognition of speech in an audio signal,for example an audio signal spoken by a speaker.

More particularly, the invention relates to an automatic voicerecognition method and system based on the use of acoustic models ofvoice signals, wherein speech is modeled in the form of one or moresuccessions of voice units each corresponding to one or more phonemes.

A particularly interesting application of such method and systemconcerns the automatic recognition of speech for voice dictation or inthe case of telephone-related interactive voice services.

Various types of modeling can be used in the context of speechrecognition. In this respect, reference can be made to the article byLawrence R. Rabiner entitled “A Tutorial on Hidden Markov Models andSelected Applications in Speech Recognition”, Proceedings of the IEEE,volume 77, No. 2, February 1989. This article describes the use ofhidden Markov models to model voice sequences. According to such amodeling, a voice unit, for example a phoneme or a word, is representedin the form of a sequence of states, each associated with a probabilitydensity modeling a spectral shape that has to be observed on this stateand that results from an acoustic analysis. A possible variant ofimplementation of the Markov models consists in associating theprobability densities with the inter-state transitions. This modeling isthen used to recognize a spoken speech segment by comparison with theavailable models associated with known units by the voice recognitionsystem and obtained by a prior learning process.

The modeling of a voice unit is, however, strongly linked to the contextin which a voice unit is situated. In practice, a phoneme can bepronounced in different ways depending on the phonemes that surround it.

Thus, for example, the French language words “étroit” and “zéro” whichcan be represented phonetically as follows:

“ei t r w a”;

and

“z ei r au”,

contain a phoneme “r”, the sound of which differs because of the soundof the phonemes that surround it.

In order to take account of the influence of the context in which aphoneme is situated, the voice units are normally modeled in the form oftriphones which take account of the context in which they are situated,that is, according to the preceding voice unit and the next voice unit.Thus, by considering the words “étroit” and “zéro”, these words can beretranscribed by means of the following triphones:

étroit: &[ei]t ei[t]r t[r]w r[w]a w[a]&

zéro: &[z]ei z[ei]r ei[r]au r[au]&

According to this representation, the “&” sign is used to mark thelimits of a word. For example, the triphone ei[t]r denotes a unitmodeling the phoneme “t” when the latter appears after the phoneme “ei”and before the phoneme “r”.

Another approach taking account of the context of a phoneme can consistin using voice models with voice units that correspond to a set ofphonemes, for example a syllable. According to this approach, the words“étroit” and “zéro” can be represented, using a voice unit correspondingto a syllable, as follows:

étroit: ei t|r|w|a

zéro: z|ei r|au

As can be seen, such approaches require the availability of a largenumber of models to recognize words or sentences.

The number of units, taking into account contextual influences, dependsgreatly on the length of the context concerned. If the context islimited to the unit that precedes it and the unit that follows it, thepossible number of contextual units is then equal to the number ofnon-context units to the third power. In the case of the phonemes (36 inFrench), this gives 36³. In the case of the syllables, the result isN×N×N, with N being in the order of several thousands. In this case, thenumber of possible voice units increases prohibitively and then requiresvery great resources in terms of memory and computation capability toimplement a reliable voice recognition method.

Furthermore, there is not enough learning data available to estimatecorrectly such a high number of parameters.

The object of the invention is to overcome the above-mentioned drawbacksand to provide a speech recognition method and system that makes itpossible to limit considerably the number of parameters needed to modellong voice units, namely, voice units corresponding to a syllable or toa series of phonemes.

SUMMARY OF THE INVENTION

The invention thus proposes a method of recognizing speech in anacoustic signal, comprising steps for developing acoustic stochasticmodels of voice units in the form of a set of states of the acousticsignal and using the acoustic models to recognize the voice signal bycomparing this signal with predetermined acoustic models obtained via aprior learning process.

According to a general feature of this method, while the acoustic modelsare being developed, one or more voice units are modeled by means of afirst portion of the states independent of adjacent voice units and bymeans of a second portion of the states dependent on adjacent voiceunits, the second portion of states dependent on adjacent voice unitssharing common parameters with a plurality of units sharing the samephonemes.

According to another feature of this method, the first portion of statesindependent of adjacent units, corresponds to median states of the voiceunit and the second portion of states dependent on adjacent voice unitscorresponds to start and end states of the voice unit.

The portions independent of adjacent voice units can be specific to asingle model.

In an embodiment, the states are each associated with an observationprobability density. It is also possible to allow for observationprobability densities to be associated with inter-state transitions. Thecommon parameters are then the probability densities.

It is also possible to provide for the second portion of statesdependent on adjacent voice units to further comprise at least onetransition state which is used to connect states independent of adjacentvoice units and has no probability density.

Furthermore, it is possible to provide for the states independent ofadjacent voice units to be associated with transitions designed to causeconsecutive state skips.

For example, the acoustic models are hidden Markov models.

The invention also proposes a voice recognition system, comprising meansof analyzing voice signals for developing a sequence of observationvectors, means for developing an acoustic model of each signal in theform of a set of states of the signal, and means of comparing theacoustic signal with predetermined acoustic models obtained by a priorlearning process and stored in a database.

The acoustic models of one or more voice units include a first portionof states independent of adjacent voice units and a second portion ofstates dependent on adjacent voice units, the second portion of statesdependent on adjacent voice units sharing common parameters with aplurality of units sharing the same phonemes.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects, features and advantages of the invention will becomeapparent from reading the description that follows, given purely as anon-limiting example, and made with reference to the appended drawingsin which:

FIG. 1 is a block diagram illustrating the general structure of a voicerecognition system according to the invention;

FIG. 2 is a diagram illustrating exemplary modeling of a voice signal;

FIG. 3 illustrates a variant of modeling a voice signal; and

FIG. 4 illustrates another variant of modeling a voice signal.

DESCRIPTION OF PREFERRED EMBODIMENTS

In FIG. 1, the general structure of a speech recognition systemaccording to the invention, denoted by the general numeric reference 10,is represented very diagrammatically.

This system 10 is intended to analyze a voice signal P so as to developa series of observation vectors, which are then processed to recognizewords M, the models of which are known, contained in the signal P. Themodels are constructed from series of observation vectors so as tocharacterize the voice units, namely words, phonemes or series ofphonemes, with which they are associated.

In the rest of the description, it will be assumed that the modelingconsists in developing hidden Markov models. It will, however, be notedthat the invention also applies to any other type of modelingappropriate for the envisaged use.

The development of hidden Markov models is a known technique within thescope of those skilled in the art, so it will not be described in detailbelow. For this, reference can be made to the abovementioned document “ATutorial on Hidden Markov Models and Selected Applications in SpeechRecognition” by Lawrence R. Rabiner, Proceedings of the IEEE, volume 77,No. 2, February 1989, incorporated for reference, which describes thistechnique in detail.

It will, however, be noted that the hidden Markov models constitutestochastic models developed so as to describe processes that evolve overtime, and that can be in a finite set of states, not directlyobservable, each sending state being associated with a probabilitydensity which models the spectral shapes that are observed on the signaland that result from the acoustic analysis of the signal. In embodimentvariants, these observation probability densities can be associated withinter-state transitions. However, in the context of the presentdescription, the term “state” is understood to mean both a state properand the probability density associated with it, the task of transposingthe teaching of the present patent application to an embodiment in whichthe probability densities are associated with inter-state transitionsbeing well within the scope of those skilled in the art.

As can be seen in FIG. 1, the speech recognition system comprises afirst module 12 used to analyze the voice signal P so as to develop aseries of observation vectors.

It spectrally or temporally analyzes the signal P, for example, by meansof a rolling window, then develops the observation vectors by extractingrelevant coefficients. For example, such coefficients are the cepstrumcoefficients, also called MFCC coefficients (“Mel-Frequency CepstrumCoefficients”).

In the model construction phase, the duly developed vectors are used tobuild a model for each unit (word, phoneme or series of phonemes).

As is known per se, an HMM model is characterized by a set ofparameters, namely the number of states of the model, the inter-statetransition probabilities, and the observation vector sending probabilitydensities.

After modeling a word, a second module 14 analyzes the correspondencewith a set of models obtained by a prior learning process and extractedfrom a database 16 so as to identify the candidate word or words.

As indicated previously, this correspondence analysis is performed onthe basis of acoustic units, each modeled by an HMM model.

FIG. 2 represents a possible topology of a phoneme model.

According to a feature of the invention, this phoneme model is designedto take account of the context in which the phoneme is situated, thatis, the preceding and next phoneme or phonemes or classes of phonemes.As can be seen in this FIG. 2, and as indicated previously, thismodeling is based on the development of a set of descriptive states suchas E, each associated with an observation probability density Pr. Toproceed with this modeling, a set of internal states 18 is defined, thestates being independent of the context of the phoneme or phonemesconcerned and of the external states 20 and 22 dependent on an adjacentvoice unit. It will, however, be noted that the internal states 18 canalso be made dependent on the context in order to increase the accuracyof the model.

The models of the words or expressions to be recognized are obtained byconcatenating models of the units (phonemes or series of phonemes) andin connecting them according to the context, by selecting relevantstates (E). Thus, the parameters of the external states 20 and 22 takeaccount of the contextual influence of each phoneme. Regarding theinternal states 18, the first and last states have a dependence withrespect to the lesser context because of the fact that their parametersare estimated by using data from all the versions dependent on thecontext of each phoneme.

As is represented in FIG. 3, in which elements identical to those ofFIG. 2 are represented by the same numerical references, it is alsopossible to add, between the external states 20 and 22, on the one hand,and the internal states 18, inert or non-sending states 24 and 26 usedmainly to connect the external states 20 and 22 to the internal states18, in particular when modeling long voice units, but also when modelingphonemes.

FIG. 4 shows another modeling used to model long voice units. In thisfigure, the external states 28 and 30 correspond to the external states20 and 22 in FIG. 3. The central states 32 include, in this example,three states for each phoneme, E1, E2, E3 and E′1, E′2 and E′3, for thephonemes “k” and “e” respectively, separated by a connecting state 34,each of these states being associated with an observation probabilitydensity. It will be noted that the connecting state 34 could be sharedby all the voice units in which a “k” is associated with an “e”.

The modeling illustrated in FIG. 4 can be used to model the voice unit“k|e”. In principle, this voice unit is divided into three parts, namelya left part, a central part and a right part, respectively denoted:

“k_l” “k|e_c” and “e_r”

which respectively correspond to the states 28, 32 and 30. It will benoted that other breakdowns can also be envisaged.

The states 28 and 30 constitute as many inputs and outputs as there areleft and right contexts, respectively, for the voice unit concerned. Inother words, the number of states 20 and 22 is determined according toleft and right contexts, respectively for the phonemes k and e. It willbe noted that the relevant contextual states are selected whenconcatenating the models of the voice units to build the models of thewords or the expressions.

In the different embodiments considered, all the voice units that haveone and the same first phoneme share a same left model 20, 28, dependenton the context or indeed share only some common parameters, inparticular observation densities. Thus, for example, the long voice unit“k|e” and the phoneme “k” share the same model “k_l”, as for any othervoice unit that begins with a “k”. Such is also the case concerning thelast phoneme. All the voice units that have the same last phoneme sharethe same right model dependent on the context, in this case “e_r”. Theparameters of these contextual models are thus shared between a largenumber of units, and therefore estimated based on a large number of dataitems only the central part 18, 32 is specific only to one voice unit.It is, however, possible to make this central part 18, 32 dependent onthe context, to a certain extent, by providing specific state transitionpaths within its central states tending to skip one or more states andthis according to the context.

As can be seen, with the invention according to which contextualparameters are shared between long acoustic units (phonemes, syllablesor any series of phonemes), the number of parameters necessary forspeech recognition with long acoustic units is considerably reduced.

1. A method of recognizing speech in an acoustic signal, comprising thesteps of: modeling at least one voice unit of an acoustic signal, suchthat at least one voice unit is represented in the form of a stochasticacoustic model comprising a plurality of states with a first portion ofstates and a second portion of states; and recognizing a voice signal bycomparing said voice signal with said stochastic acoustic model obtainedvia a prior learning process, wherein the step of modeling the at leastone voice unit comprises modeling the first portion of states, whichcorresponds to median states of the voice unit and comprises stateswhich depend on the whole of the voice unit, and modeling the secondportion of the states, which corresponds to start and end states of thevoice unit and depends on adjacent voice units that share commonparameters with a plurality of voice units sharing the same phonemes. 2.The method as claimed in claim 1, wherein the first portion of states isspecific to a single model.
 3. The method as claimed in claim 1, whereinthe states are each associated with a respective observation probabilitydensity.
 4. The method as claimed in claim 3, wherein said commonparameters comprise the probability densities.
 5. The method as claimedin claim 3, wherein the second portion of states further comprises atleast one transition state used to connect states independent ofadjacent voice units and having no probability density.
 6. The method asclaimed in claim 3, wherein the states associated with the first portionof states are independent of adjacent voice units and associated withtransitions designed to cause consecutive state skips.
 7. The method asclaimed in claim 1, wherein the stochastic acoustic model comprisesinter-state transitions which are associated with probability densities.8. The method as claimed in claim 7, wherein said common parameterscomprise the probability densities.
 9. The method as claimed in claim 7,wherein the states associated with the second portion of states aredependent on adjacent voice units, and wherein the second portion ofstates further comprises at least one transition state used to connectstates independent of adjacent voice units and having no probabilitydensity.
 10. The method as claimed in claim 7, wherein the statesassociated with the first portion of states are independent of adjacentvoice units and associated with transitions designed to causeconsecutive state skips.
 11. The method as claimed in claim 1 wherein,the acoustic models comprise hidden Markov models.
 12. The method asclaimed in claim 1, wherein the first portion of states are partiallydetermined according to the adjacent voice units.
 13. The method asclaimed in claim 1, wherein the first portion of states are independentof adjacent voice units and the second portion of states are dependenton adjacent voice units.
 14. A voice recognition system, comprising:means for analyzing a voice signals, having a plurality of voice units,and for developing a sequence of observation vectors; means fordeveloping a stochastic acoustic model for each voice signal in the formof a plurality of states; and means for comparing each of the stochasticacoustic models with a predetermined acoustic model obtained by a priorlearning process and stored in a database, wherein at least onestochastic acoustic model includes a first portion of statescorresponding to median states of a voice unit and comprising statesthat depend on the whole of the voice unit, and a second portion ofstates corresponding to start and end states of the voice unit andcomprising states dependent on adjacent voice units that share commonparameters with a plurality of voice units sharing the same phonemes.